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Applications  of  Latent  Trait  Theory  to  the  Development 
and  Use  of  Criterion-Referenced  Tests1 


Ronald  K.  Horrible  ton 
University  of  Massachusetts ,  Amherst 

The  success  of  competency  based  education  will  depend  to  a  con¬ 
siderable  extent  upon  how  effectively  criterion-referenced  tests  are  con¬ 
structed,  and  how  the  test  scores  are  used  (1)  to  assess  examinee  per¬ 
formance  levels  and  (2)  to  make  masterv/non--mastery  decisions.  It  is 
common  to  define  a  criterion-referenced  test  as  a  test  which  is  designed 
to  provide  examinee  data  relative  to  well-defined  objectives  being  measured 
by  a  test  (Popham,  1978).  ''Well-defined"  means  that  each  objective  is 
stated  in  such  a  way  that  the  relevant  pool  of  possible  test  items  mea¬ 
suring  an  objective  is  clear  to  anyone  who  makes  use  of  the  test  scores 
or  who  becomes  involved  in  the  test  development  process  (for  example,  item 
writers  and  item  reviewers). 

Up  until  about  five  years  ago  there  was  a  considerable  amount  of 
energy  being  expended  in  che  development  of  criterion-referenced  tests  and 
in  the  use  of  criterion-referenced  test  scores.  However,  the  potential  of 
these  criterion-referenced  testing  programs  was  not  often  realized 

^The  project  was  performed  pursuant  to  a  contract  from  the  United 
States  Air  Force  Office  of  Scientific  Research.  However,  the  opinions 
expressed  here  do  not  necessarily  rerlect  their  position  or  policy,  and 
no  official  endorsement  by  the  Air  Force  should  be  inferred. 

laboratory  of  Psychometric  and  Evaluative  Research  Report  No.  91_. 
Amherst,  MA:  School  of  Education,  University  of  Massachusetts,  1979. 

A  paper  presented  at  an  AERA-NCME  symposium  entitled  "Psychometric 
Approaches  to  Domain-Referenced  Testing,"  San  Francisco,  April  1979. 
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either  because  of  poorly  constructed  tests  or  misinterpreted  test  scores  > 
or  both.  Undoubtedly  such  a  state  of  affairs  existed  because  of  the 
shortage  of  technical  guidelines  to  aid  both  test  developers  and  test 
score  users.  Often  the  test  items  did  not  measure  the  intended  objectives, 
too  few  test  items  were  used  in  the  tests,  performance  standards  were  set 
without  due  consideration  of  the  relevant  issues  and/or  using  proper 
methods,  and  so  on. 

Fortunately,  there  is  no  reason  for  the  problems  to  exist  anymore. 

There  have  been  a  large  number  of  very  useful  contributions  to  a  criterion- 
referenced  testing  technology  and  you  have  heard  about  many  of  these  from 
the  other  presenters  at  this  symposium  (Brennan,  Huynh,  Subkoviak) .  Such 
contributions  are  making  it  possible  to  develop  better  criterion-referenced 
tests  and  to  use  the  scores  in  more  appropriate  ways  (Pophnm,  1978;  Hambleton, 
Swaminathan,  Algina,  &  Coulson,  1978).  For  example,  much  is  known  about 
steps  for  developing  criterion-referenced  tests,  assessing  content  validity, 
assembling  tests,  setting  performance  standards  and  assessing  test  reliability. 

Before  I  lull  the  reader  into  a  state  of  euphoria,  ]et  me  be  quick  to 
point  out  that  many  very  important  problems  remain.  For  one,  what  are  the 
best  methods  for  obtaining  more  accurate  es  ...  ces  of  examinees'  domain  scores 
(level  of  performance  scores  relative  to  each  jective  being  tested)  and  for 
decreasing  the  frequency  of  times  examinees  are  misclassif ied  (assigned  to 
"non-mastery"  states  when  they  are  "masters"  and  assigned  to  "mastery"  states 
when  they  are  "non-masters")? 

Ch.,aent  mastery  of  objectives  in  a  unit  or  nodule  is  often  df^ermined  by 
an  administration  of  a  criterion-ref ei-cnc^d  test.  "Mastery"  is  inferred  when 
a  student's  test  performance  on  a  set  of  items  measuring  an  objective  exceeds 
sonf  minimum  per f or ... i.'. ; t  lev-;.  Thc  minimum  f: — m;:ce  level  for  mastery 

is  often  referred  to  as  a  cutting  score  or  passing  score. 
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In  theory,  criterion-referenced  test  scores  mu  be  made  as  reliable 
and  valid  as  necessary  by  adding  additional  test  items.  Unfortunately, 
making  a  mastery — non-mastery  decision  on  each  of  the  objectives  measured 
by  a  criterion-referenced  test  often  requires  a  considerable  amount  of 
testing  time.  Therefore,  it  is  usually  impractical  to  consider  lengthening 
tests,  particularly  to  the  length  that  would  often  be  necessary  to  accom¬ 
plish  some  desired  goal  for  reliability  and  validity  of  test  scores. 

Some  critics  have  argued  there  is  already  too  much  criterion- 
referenced  testing  for  instructional  and  program  evaluation  purposes.  On 
the  other  hand,  some  increase  in  testing  time  can  be  defended  on  the 
grounds  that  test  response  data  is  closely  tied  to  the  objectives  defining 
a  curriculum  and  that  the  data  are  used  to  monitor  student  progress. 
Nevertheless,  it  seems  clear  that  research  is  needed  on  procedures  offering 
potential  for  reducing  testing  time  without  reducing  the  quality  of 
decision-making  from  test  score  results. 

The  use  of  Bayesian  statistical  procedures  represents  one  promising 
method  for  reducing  testing  time  and/or  improving  the  quality  of  mastery 
decisions  (Hambleton  &  Novick,  1973;  N'ovick  &  Jackson,  1974;  Swaminathan , 
Hambleton,  &  Algina,  1975).  This  method  is  particularly  appealing 
because  it  requires  no  change  from  the  most  common  methods  of  test  admin¬ 
istration.  Improvements  in  decision  making  are  attributable  to  the  util¬ 
ization  of  information  ignored  by  non-Eayesian  procedures.  Bayesian  proc- 
dures  r.a>  use  not  only  the  direct  information  provided  by  an  examinee's 
test  score,  but  they  also  make  use  of  coJ lateral  information  contained  in 
the  data  of  other  examinees  and  of  prior  information  on  other  relevant 
data  that  are  available  on  the  examinee  (e.g.,  test  scores  from  other 

SPgi  i  i,  _  f  i''.  fiv.irspj  . 
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In  one  simulation  study  Ilnnbl  eton ,  Hutton,  and  Swan  j  na  than  (I'j/'L) 
compared  several  Bayesian  estimation  procedures  with  several  classical 
procedures  for  assessing  student  mastery  and  making  instructional  deci¬ 
sions.  They  reported  modest  gains  from  use  of  the  Bayesian  estimation 
procedures.  On  the  negative  side,  Bayesian  statistical  procedures  are 
based  on  restrictive  assumptions,  and  robustness  of  the  procedures  has  not 
been  studied  extensively.  Also,  some  individuals  fee]  that  the  utiliza¬ 
tion  of  group  information  to  influence  individual  mastery  estimates  is  a 
contradiction  of  one  of  the  fundamental  postulates  of  obj ectives-based 
instruction,  that  is,  each  student  is  judged  on  his/her  own  merits;  thus, 
mastery  decisions  should  not  depend  on  the  performance  of  other  students. 

There  is  a  second  solution  to  the  problem  sketched  out  earlier 
(and  other  testing  problems).  This  solution  involves  the  use  of  latent 
trait  theory  (Lord  &  Novlck,  1968;  Hambleton,  fwamina than ,  Cook,  Eigne r, 

&  Gifford,  1979  ).  Considerable  research  has  been  done  with  latent  trail, 
models  and  concepts  and  many  applications  to  testing  have  been  highly 
successful  but  relatively  little  specific  attention  to  criterion- 
referenced  testing  problems  has  been  given.  Specific  attention  is  important 
because  norm-referenced  tests  and  criterion-referenced  tests  are  con¬ 
structed,  analyzed,  and  test  scores  interpreted  in  fundamentally  different 
ways  (norm-referenced  tests  are  constructed  to  facilitate  comparing  one 
person  with  another  on  the  ability  measured  by  a  test;  cri ter  ion-ref cionced 
tests  are  constructed  to  determine  examinee  level  of  performance  relative 
to  the  objectives  measured  by  the  test)  and  therefore  latent  trait  theoretic, 
results  vdiich  apply  to  norm- referenced  tests  will  not  necessarily  apply  to 
criterion-referenced  tests.  Unfortunately,  mu-h  of  the  research  and 
development  work  has  been  done  with  respect  to  norm -referenced  tests  (see, 
for  example,  work  by  Hambleton  etal.,  1979;  Lord,  in  press;  Weiss,  1978). 
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Purposes  of  che  Study 

Two  important  technologies  have  emerged  in  the  last  ten  years  which 
have  considerable  potential  for  improving  the  assessment  of  individuals. 

The  first,  criterion-referenced  testing  technology,  is  the  better  known  of 
the  two,  and  is  being  used  throughout  the  country  in  a  variety  of  ways 
(for  example  screening  of  students,  monitoring  student  progress  in  courses, 
assigning  student  grades,  and  licensing  and  certification).  Nevertheless, 
many  technological  problems  remain  and  therefore  these  new  criterion- 
referenced  testing  programs  are  not  achieving  their  full  potential.  The 
second,  latent  trait  theoretic  technology,  has  developed  more  slowly, 
but  is  now  being  used  in  many  types  of  testing  programs.  A  cursory  glance 
at  the  1979  AERA  and  NCME  annual  meeting  program  will  quickly  substantiate 
the  extensive  use  of  latent  trait  models.  There  is  one  notable  exception. 

It  is  not  being  used  in  any  extensive  way  with  criterion-referenced  tests. 

This  is  unfortunate  because  latent  trait  models  and  concepts  have  lead  to 
many  important  norm-referenced  test  developments  (see,  for  example,  Hambleton 
&  Cook,  1977),  and  appear  to  have  the  capability  of  resolving  some  of  the 
technological  problems  associated  with  the  construction  and  uses  of  criterion- 
referenced  tests. 

The  goal  of  this  paper  is  to  consider  latent  trait  theory  as  a  frame¬ 
work  for  resolving  some  of  the  technical  problems  associated  with  criterion- 
referenced  tests.  Specifically,  (1)  a  brief  introduction  to  latent  trait 
models  and  concepts  is  offered,  (2)  features  of  latent  trait  models  which 
have  special  relevance  to  criterion-referenced  testing  are  considered,  (3) 
several  applications  of  latent  trait  models  arc  introduced,  and  (4)  con¬ 
clusion:;  and  suggestions  for  further  research  are  provided.  The  four 
section:;  of  the  paper  correspond  to  the  four  specific  purposes  outlined  above 
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Br  ief  Introduction  to  Latent  Trait  Woccj_s  and  Concjtuts 

A  theory  of  Intent  traits  supposes  that,  in  testing  situations, 
examinee  performance  on  a  test  can  be  predicted  (or  explained)  by 
examinee  characteristics,  referred  to  as  traits.  Scores  for  examinees 
on  these  traits  arc  estimated  and  used  to  predict  or  explain  test  per¬ 
formance  (Lord  and  Novick,  19C8) .  Since  the  traits  are  not  directly 
measurable,  they  are  referred  to  as  latent  traits  or  abilities .  A 
latent  trait  model  specifies  a  relationship  between  the  observable 
examinee  test  performance  and  the  unobservable  traits  or  abilities  as¬ 
sumed  to  underlie  performance  on  the  test.  The  relationship  between 
the  "observable"  and  the  "unobservable"  quantities  is  described  by  a 
mathematical  function.  The  concept  of  a  "latent  trait,"  and  a  "domain 
score"  in  the  context  of  criterion-referenced  measurement  are  the  same.  The 
relationship  is  an  algebraic  one  and  is  specified  by  the.  "test  character¬ 
istic  curve,"  a  term  which  will  be  defined  later. 

When  selecting  a  particular  latent  trait  mode]  to  apply  to  one's 
test  data,  it  is  necessary  to  consider  whether  the  data  satisfy  the 
assumptions  of  the  model.  If  they  do  not,  different  test  models  should 
be  considered.  Alternately,  some  psychometricians  (for  example,  Wright, 
19GS)  have  recommended  that  test  developers  design  their  tests  so  as  to 
satisfy  the  assumptions  of  the  particular  latent  trait  model  they  are 
interested  in  using.  In  this  way,  the  advantages  of  the  particular 
latent  trait  model  of  interest  can  be  utilized. 

The  three  fundamental  assumptions  underlying  the  most  commonly  used 
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latent  trait  models  are:  The  unidiraensionality  of  the  test  items,  local 
independence,  and  the  mathematical  form  of  the  item  characteristic  curves. 
Each  of  these  assumptions  will  be  discussed  briefly.  Two  other  important 
topics  will  also  be  considered:  Item  and  test  information  curves,  and 
efficiency. 

The  Assumption  of  Unidimensionality 

The  assumption  of  a  unidimensional  set  of  test  items  is  a  common 
one  for  test  constructors,  since  they  usually  desire  to  construct  uni¬ 
dimensional  tests  so  as  to  enhance  the  interpretabi li ty  of  a  set  of  tost 
scores  (Lumsden,  1976).  This  is  certainly  the  case  with  criterion- 
referenced  tests  since  a  key  characteristic  of  a  good  criterion-referenced 
test  is  the  interpretability  of  scores  derived  from  the  test. 

Lumsden  (1961)  provided  an  excellent  review  of  methods  for  construct¬ 
ing  unidimensional  tests.  He  concluded  that  the  method  of  factor  analysis 
held  the  most  promise.  Fifteen  years  later  he  reaffirmed  his  conviction 
(Lumsden,  1976).  Essentially,  Lumsden  recommends  that  a  test  constructor 
generate  an  initial  pool  of  test  items  selected  on  the  basis  of  empirical 
evidence  and  a  priori  grounds.  In  the  jargon  of  criterion-referenced 
measurement,  items  are  written  to  match  domain  specifications  and  arc  dis¬ 
carded  when  it  can  be  determined  that  they  are  invalid  for  their  intended 
purposes.  Such  an  item  selection  procedure  will  increase  the  likelihood 
that  a  unidimensional  set  of  test  items  within  the  pool  of  items  can  be 
found.  If  test  items  are  not  preselected,  the  pool  may  be  too  hoterogeneou 
for  the  unid imens i onal  set  of  items  in  the  item  pool  to  emerge.  In  I.unsden' 
method,  a  factor  analysis  is  performed  and  items  that  arc  not  measuring 
the  dominant  factor  obtained  in  the  factor  solution  are  removed.  The 
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remaining  items  are  factor  analyzed,  and  again,  "deviant"  items  are 
removed.  The  process  is  repeated  until  a  satisfactory  solution  is  ob¬ 
tained.  Convergence  is  most  likely  when  the  initial  item  pool  is 
caretully  selected  to  include  only  items  that  appear  to  he  measuring  a 
common  trait.  Lumsden  proposed  that  the  ratio  of  first  factor  variance 
to  second  factor  variance  be  used  as  an  "index  of  un id imensional i t y . " 
Rejected  test  items  should  be  studied  to  determine  the  possible  basis 
for  their  misfit.  In  some  instances,  it  nay  be  necessary  to  rewrite 
the  domain  specifications  to  reflect  the  test  items  which  remain. 

Local  Jndcpenl -nice 

The  second  assumption  is  that  of  local  i  nthzpcn'bsn  o.  The  rr.unp-- 
tlon  states  that  the  tent  item  responses  of  a  given  examinee.  ..re  .'tatis-- 
ticalJy  independent.  This  means  that  an  examinee’ s  performance  on  one 

item  docs  not  affect  his  or  her  performance  on  other  'terns  in  the  test.  Vi 
result  would  be  obtained  if  the  tost  items  measure',  a  .single  ability. 

I  tp'ui  Character! Stic  Curves 

An  item  characteristic-  curve  is  a  mathematic  ul  function  that  mi  a!  c.? 
the  probability  of  success  on  an  item  to  the  ability  measured  by  the  net  ■ 
contained  in  the  tost.  There  is.  r.c  conce-p/t  comparab  1  e  (:o  the 
notion  of  an  item  characteristic  curve  in  tandard  test  technology.  A 
primary  distinction  among  different  latent  trait  models,  is  in  the  mat  he¬ 
matic  al  fora  of  the  corresponding  item  characteristic  curves.  It  is  up 
to  the  user  to  choose  one  of  the  many  mathematical  1 orms  for  the  shape  of 
the  item  characteristic  curves.  In  doing  so,  an  nr. .sumption  ah  out  the 
items  is  being  made  which  can  be  verified  later  by  hoy  well  tin-  chosen 
model  "explains"  obtained  test  results. 
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Each  it em  characteristic  curve  for  a  particular  latent  trait  model 
is  a  member  of  a  family  of  curves  of  the  sane  genera],  form.  The  number 
of  parameters  required  to  describe  an  item  characteristic  curve  will 
depend  on  the  particular  latent  trait  model. 

The  mathematical  expression  for  the  three-paraneter  logistic  curve 


is : 


PJ-3)  '  cg  +  (l-cg) 


Da  (0-b„) 

G  b _ o 

Da,  (6-b„) 
1+e  g  o 


g=l.  2, 


n, 


where : 

P  (a)  =  the  probability  that  an  examinee  with  ability  level  C 
answers  iter,  g  correctly, 

bg  =  the  item  difficulty  parameter, 

a,,  =  the  item  discrimination  parameter, 

and 

D  =  1.7  (a  scaling  factor). 

The  parameter  is  the  lower  asympt  .t:c  of  the  item  characteristic 

curve  and  represents  the  probability  of  examinees  with  low  ability 

correctly  answering  an  item.  The  parameter  c„  is  included  in  the  model 

to  tW'count  for  test  response  data  at  the  low  end  of  the  ability  continuum, 

where  among  other  things,  guessing  is  a  factor  in  test  performance.  It 

is  now  common  to  refer  to  the  parameter  c  as  the  pseudo- chance  level 

g 

parameter  in  the  model . 

Typically,  c„,  assumes  values  that  are  smaller  than  the  value  that 
would  result  if  examinees  of  low  ability  were  to  guess  randomly  to  tno 
item.  As  Lord  (19/.)  has  noted,  this  phenomenon  can  probably  be  attri¬ 
buted  to  the  ingenuity  of  item  writers  in  developing  "attractive"  but 
incorrect  choices.  for  this  reason,  avoidance  of  the  label  "guessing 


parameter"  to  describe  the  parameter  c,  is  desirable. 
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The  popular  "Rasch  model"  or  one-par. '.mu:  or  logistic,  rest  model  can 
be  obtained  from  the  three-parameter  logistic  model  ty  making  tv;o  assump¬ 
tions  about  *-he  test  data:  (1)  the  amount  of  guessing  is  minimal,  and 
(2)  items  included  in  a  test  are  equally  discriminating. 

Item  characteristic  curves  for  several  latent  trait  models  are  pre¬ 
sented  in  Figure  1. 


Item  and  Test  Information  Curves 


Once  a  latent- l  ra  it  model  is  specif'-',  ;  hr  precir  io  ’  with  v  ii  i  <•  h 
it  estimates.  e.'/.araincc  abii  ity  can  be  dct:».n  ii.-d,  hi » about--  (1.9  SOI  br-f  iv.cd 
the  notion  of.  information  as  a  quantity  ii;vt  ,  ?  c  1  y  proportional,  to  f.hf 
squared  length  of  t  he-  confidence  ott-rva  1  a)  ounb  on  <•:  tin-art;  of  tiU  C:  V,  ;-t\ ii  j. f; 
ability.  The  standard,  enor  of  estimate  <  ;  i.bi.,1  ity  if.  equal  to  1/ n formal  i 
When  i  nfor  cv.it i  on  at  an  ability  level  is  high ,  <-’C  have  narrow  confidence 
bands  around  cur  estimates,  if  information  is  low,  we  have  wider  confi¬ 
dence  bands,  because  the  information  function  varies  with  ability  level, 
it  has  been  suggested  that,  test  informal.  Lo,.  carver.  ought  to  rapine,  the 
vise  of  classical  reliability  estimates  and  standard  errors  of  measurer, c-.nt 


in  test  score  interpretations. 

In  r  a  thematic:,!  terms,  Kirnbnur  (Iff")  gives  the  infer. •.ration  curve 
of  a  given  see. ring  fontiula.  by 


»•  « 


Tn  the  expression  above, 


I  (9)  is  the 


amount  of  inform.,  t.  inn  at 


ab i 1 i ty 


by  the  flooring  formula  y,  whcie 


y 


-Z-.X,: 


level  C  provided 
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the  variable  X  is  0  cv  1  depending  on  whether  or  not  item  g  is  answered 

O 

correctly;  P  is  the  probability  of  a  correct  answer  to  item  g  by  an 
S 

examinee  with  ability  level  0;  Q  is  equal  to  1  -  P  ;  P'  is  the  slope  of 

S  E  S 

the  item  characteristic  curve  at  ability  level  0;  and  the  item  scoring 
weights  are  w  ,  g  =  1,  2,  ....  n. 

O 


Rirnbaum  (1908)  has  shown  that  the  maximum  vaiu^  of  1(0),  ref  erred 

J 

to  as  the  test  in fox mat  ion  curve,  is  given  by 


1(6) 


n  r' 2 

y  (— &~) 

L  'y  Q  ' 


m 


g»i  a '8 

The  maximum  value  of  the  information  curve  of  a  given  scoring  formula 

is  obtained  when  the  scoring  weights  are  chosen,  such  that 

P* 

v  -  —S~ 
g  P  Q 
St 

The  quantity  P'2/P„Q  is  the  contribution  of  i  ten--  g  to  the 

b  b  g 

information  function  of  the  test  and  is  referred  to  as  the  i tom 
information  funefion.  Item  information  functions  have  an  important 
role  in  determining  the  accuracy  with  which  ability  .is  estimated  ai 
different  levels  of  0.  Each  item  information  curve  depends  on  the  slope 
of  the  particular  item  characteristic  curve  and  the  conditional  variance 
of  test  scores  at  each  ability  level  0.  The  higher  the  slope  of  the  item 
characteristic  curve  and  the  smaller  the  conditional  variance,  the  higher 
will  be  the  item  information  curve  at  that  particular  ability  level. 

The  height  of  the  item  information  curve  at  a  particular  ability  level 
is  a  direct  measure  of  the  usefulness  of  the  item  for  precisely  measuring 


ability  at  that  level. 
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The  information  function  for  the  test  composed  of  the  items  in  ob~ 
tained  by  summing  the  ordinates  of  the  item  information  curves.  From 
Equation  111  it  is  clear  that  items  contribute  independently  to  thr  test, 
information  function.  Birnbaum  (1968)  has  also  shown  that  with  his 
three-parameter  model,  an  item  provides  maximum  information  at  an  ability 
level  0 ,  where 


e  =  bR  "  T7TT  losc  1/2  0  4  /TTsV* 

g  B 


If  guessing  is  minimal,  then  c  =  0,  and  6  b  . 

8  g 


Figures  2  to  5  show  ten  item  characteristic  curves  and  corresponding 
item  information  curves.  The  influence  of  the  pseudo-chance  level  parameter 
is  clear  from  the  figures:  When  c  >0,  (1)  the  lower  asymptote  of  the  item 

o 

characteristic  curve  is  different  from  zero,  (2)  less  information  is 
obtained,  and  (3)  the  point  of  maximum  information  is  shifted  to  a  some¬ 
what  higher  ability  level.  Figure  8  shows  the  calculation  of  a  test 
information  curve  from  five  item  information  curves.  In  passing,  perhaps 
it  should  be  noted  that  when  item  parameter  estimates  are  used  in  place  of 
item  parameters,  test  information  curves  are  called  "score  information 
curves"  by  Lord  (in  press). 


Efficiency 

A  concept  closely  related  to  test  information  is  the  concept,  of 
efficiency.  An  efficiency  curve  is  formed  by  calculating  the  ratio  of  two 
information  curves  at  different  points  on  an  ability  continuum.  The  ef¬ 
ficiency  curve  provides  a  measure  of  the  relative  ef f ectiveness  of  two 
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Figure  8.  Information  curves  estimated  for  five  items 

and  a  five-item  test.  The  items  ore  f ron  i he 
verbal  section  of  the  SAT.  This  figure  is 
reproduced  by  permission  from  Lord  (19GS). 
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tests  (each  characterized  b>  a  different  information  curve)  for  measuring 
ability.  In  test  development  work,  it  is  common  to  compare  the  efficiency 
of  different  test  designs  (i.e.,  tests  composed  of  different  items)  for 
measuring  ability  at  different  locations  on  the  ability  continuum.  Whereas 
the  shapes  of  test  information  curves  depend  on  the  metric  chosen  for 
measuring  ability,  efficiency  curves  do  not,  and  therefore  they  are  parti¬ 
cularly  useful  in  test  development  work. 

The  process  of  determining  the  relative  efficiency  of  two  tests 
is  employed  more  often  as  part  of  the  analysis  of  existing  tests  than 
as  a  part  of  the  test  development  process. 

The  distinction  between  test  analysis  and  test  development  has  been 
made  by  Rentz  and  Bashaw  (197V).  Basically  they  define  the  test  devel¬ 
opment  process  as  one  that  allows  items  that  do  not  fit  the  model  to  be 
discarded,  whereas  in  test  analysis  applications,  "the  model  becomes 
fixed  and  data  arc  in  effect  'fitted'  to  it."  The  distinction  made  by 
Kentz  and  Bashaw  between  test  development  and  test  analysis  is  a  useful 
one . 


The  Ability  Scale  and 
Tes t  Characteristic  Curves 

If  we  were  to  administer  two  criterion-referenced  tests,  that 
measured  the  same  objective  (or  objectives),  to  the  same  group  of  examinees, 
and  the  tests  were  not  strictly  parallel  two  different  test  score  distri¬ 

butions  would  result.  The  extent  of  the  differences  between  the  two  distri¬ 
butions  would  depend,  among  other  things,  on  the  difference  between  the 
difficulties  of  the  two  tests.  Unfortunately,  there  is  no  basis  for  pre¬ 
ferring  one  distribution  over  the  other.  What  this  example  reveals  is  that, 
in  general,  the  test  score  distribution  provides  no  information  about  the 
distribution  of  ability  scores. 
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The  problem  ocniri;  because  tin'  r.iw-snu  ,  ri.il;:  :  i  ,  >  i  ...m  .n  .• 

unequal  and  different.  On  the  other  hand,  the  scale  on  which  ability 
scores  are  measured  is  one  on  which  examinees  will  have  the  same  ability 
score  across  non-parallel  tests  measuring  a  common  ability.  Thus,  even 
though  an  examinee's  test  scores  will  vary  across  non-parallel  forms  of 
a  test  measuring  an  ability,  the  expected  ability  for  an 
examinee  will  be  the  same  on  each  form. 

Most  measurement  specialists  are  familiar  with  the  concept  of  domain 
score,  the  expected  test  score  (on  a  sample  of  test  items)  for  an  examinee. 
What  is  the  relationship  between  domain  scores  and  ability  scores?  The 
test  characteristic  curve,  which  is  obtained  by  summing  the  ordinates  of 
the  ICC's,  provides  the  relationship.  This  is  easily  seen  from  the  follow¬ 
ing  argument.  Consider  the  proportion-correct  score,  Z  -  .?  .  Then 

n 

n 

E  (55 1  0 )  =  A  Z  P  (0)  ,  [2] 

8=1  g 

.  I  n 

Var (Z | 0 )  =  A2  z  P  (6)  Q  (0).  [3] 

g-1  6  S 

E(Z|0)  is  the  test  characteristic  curve  (scaled  by  1/n)  introduced  earlier. 
It  is  the  sum  of  item  characteristic  curves  for  items  included  in  the  test. 
Suppose  next  we  lengthen  the  test  by  adding  an  infinite  number  of  parallel  - 
forms.  By  definition,  E(zj0)  =  n,  the  domain  score.  Also  Var  (ejo)--  0, 
as  n  and  so  a  and  G  will  be  related  by  a  monotonic  increasing  trans¬ 

formation  which  is  the  test  characteristic  curve.  Clearly  then,  the  two 
concepts,  a  and  G,  are  the  same,  except  for  the  scale  of  measurement  used 
to  describe  each.  One  important  difference  is  that  domain  score  is  defined 
on  the  interval  [0,  n]  or  [0,  1]  whereas  ability  scores  are  usually  defined 
on  the  interval  [-co,  +°> ]  . 
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There  are  other  differences  between  domain  scores  and  ability  scores. 
A  domain  score  is  defined  for  each  sample  of  test  items.  It  is  the  ex¬ 
pected  test  score  for  an  examinee.  An  examinee's  domain  score  will  vary 
across  non-parallel  measures  of  the  same  ability.  On  the  other  hand, 
ability  score  is  defined  for  a  "pool"  or  "universe"  of  items  measuring  a 
single  ability.  An  examinee's  domain  score  in  different  samples  of  items 
would  (in  general)  vary.  However,  ability  score  is  defined  in  terms  of 
the  "pool"  of  items  from  which  the  sample  was  drawn.  Latent  trait  models 
specify  relationships  between  examinee  item  performance  and  ability,  and 
so  it  is  always  possible  to  "transform"  examinee  performance  on  a  parti¬ 
cular  sample  of  items  (defining  a  test)  onto  an  ability  scale  defined  for 
the  large  "pool"  of  test  items.  Thus,  while  an  examinee  would  have  (in 
general)  a  different  domain  score  for  each  sample  of  items  drawn  from  the 
pool  and  would  obtain  different  test  scores  in  each  sample  of  items,  the 
expected  estimate  of  examinee  ability  from  each  sample  of  test  items  would 
be  the  same.  More  will  be  said  about  this  important  relationship  later. 

Ability  scores  can  be  used  with  item  characteristic  curve  param¬ 
eters  for  items  included  in  a  test  to  estimate  examinee  test  performance. 
Recall , 

n 

l ( x I o )  =  >:  p  (0).  [4] 

g=i  b 

Thus,  ability  scores  provide,  a  basis  for  content-referenced  interpretations 
of  examinee  test  scores.  When  l  he  quantities  in  liquation  [4]  arc  scaled 
by  1/n ,  K(X/ n ] 0)  r  c  p  r  Cb  o?n  to  the  expected  proportion  of  items  in  a  test 
that  an  examinee  will  answer  correctly  and  this  interpretation  will  have 
meaning  regardless  of  the  test  performance  of  other  examinees.  Of  course, 
ability  scores  provide  a  basis  for  norm-referenced  int.erprci.nt  ions  as  wen. 
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Special  Features  of  Latent  Trail  Moor  Is 
When  latent  trait  models  fit  particular  data  sets,  three  advantages 
are  obtained.  Perhaps  the  most  important  advantage  of  latent  trait  models 
is  that,  given  a  set  of  test  items  that  have  been  fitted  to  a  latent  trait 
mode]  (that  is,  item  parameters  are  known),  it  is  possible  to  estimate  an 
examinee's  ability  on  the  same  ability  scale  from  any  subset  of  items  in 
the  domain  of  items  that  have  been  fitted  to  the  model.  (Of  course,  the 
domain  of  items  needs  to  be  homogeneous  in  the  sense  of  measuring  a  single 
ability.  If  the  domain  of  items  is  too  heterogeneous,  the  ability  estimates 
will  have  little  meaning.)  In  fact,  regardless  of  the  number  of  items 
administered  (as  long  as  the  number  is  not  too  small)  or  the  statistical  char 
acteristics  of  the  items,  the  ability  estimate  for  each  examinee  will  be 
an  asymptotically  unbiased  estimate  of  true  ability,  provided  the  latent 
trait  model  holds.  Ability  estimation  independent  of  the  particular 
choice  (and  number)  of  items  represents  one  of  the  major  advantages  of 
latent  trait  models.  Hence,  latent  trait  models  provide  a  way  of  comparing 
examinees  even  though  they  may  have  taken  quite  different  subsets  of  test 
items.  In  latent  trait  models,  the  difficulty  of  items  is  accounted  for 
by  the  model  and  reflected  in  the  ability  estimates.  Thus,  two  students, 
who  receive  identical  scores  on  an  easy  and  difficult  subset  of  the  Lest 
items,  respectively,  will  differ  in  their  ability  estimates  (the  second 
student  will  receive  a  higher  estimated  score  than  the  first). 

Another  advantage  of  latent  trait  models  is  that  item  parameters 
are  invariant  across  sub-groups  of  examinees  chosen  from  an  examinee  popu¬ 
lation.  In  principle,  item  parameters  should  remain  the  same,  regardless  oi 
the  subgroup  tested.  Invariant  item  parameters  have  been  sought  by  measurer:- • 
specialists  for  a  long  time;  their  advantages  are  obvious  for  test  develop¬ 
ment  work.  Certainly  classical  item  statistics,  such  as  item  difficulty 
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will  vary  front  group  to  group,  depending  upon  the  average  .duality  of  the 
group  being  tested.  This  invariance  property  is  shown  graphically  in 
Figure  6. 

Yet  another  advantage  is  that  they  provide  a  measure  of  the  precision 
of  ability  estimation  at  each  ability  level.  Thus,  instead  of  providing  a 
single  standard  error  of  measurement  that  applies  to  all  examinees, 
regardless  of  their  test  scores,  latent  trait  models  make  it  possible  to 
provide  separate  estimates  of  error  for  each  examinee  or  for  each  ability 
level. 

Examinee-free  item  statistics  are  especially  useful  in  "item 
banking"  and  criterion-referenced  test  development.  Item-free  ability 
estimates  permit  the  "tailoring"  of  tests  to  individuals  and  situations. 

The  concepts  of  information  and  efficiency  are  useful  in  both  test 
development  work  and  determination  of  precision  of  ability  score  estimates. 
Some  of  the  applications  will  be  considered  in  the  next  section  of  the 
paper. 

But  also,  it  is  now  time  to  consider  the  price  which  must  be  paid 
for  the  "goods"  which  are  "delivered"  via  latent  trait  models.  First, 
the  special  features  will  only  be  obtained  when  there  is  a  reasonably 
close  match,  between  the  researcher's  latent  trait  model  and  his/her  data. 

How  close?  That  question  is  currently  under  study  by  many  researchers. 
Second,  it  is  unlikely  that  the  features  will  be  obtained  with  "short" 
tests.  Hard  figures  are  difficult  to  come  by  but  it  would  appear  that 
tests  of  13  or  more  items  are  required.  Also,  sample  sizes  of  200  or 
more  examinees  will  be  required  to  produce  static-  i  tern  statistics  with 
the  one-paran-et er  model  and  somewhat  larger  samples  are  required  with  the 
two-  and  three-parameter  logistic  test  models  (Svaminathan  f,  Gifford,  1979). 
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Figure  6. 


A  diagram  showing  the  independence  of  the 
shape  of  item  characteristic  curves  from  th 
underlying  ability  distribution. 
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Other  practical  problems  include  (1)  the  tr,. n.  Lng  c-;  pract  i  tioners  to  use 
these  models,  and  (2)  the  handling  of  examinees  who  get  "rejected"  be¬ 
cause  their  test  scores  are  too  high  or  too  low. 

Applications 


Item  Banking 

The  development  of  cr ite.rion-ref erenced  testing  technology  has 
resulted  in  the  increasing  importance  of  item  banking  (Choppin,  1976). 

An  item  bank  is  a  collection  of  test  items,  "stored"  with  known  item 
character isrics  and  made  available  to  test  constructors.  Depending  on  the 
intended  purpose  of  the  test,  items  with  described  characteristics  can  be 
dravm  from  the  bank  and  used  to  construct  a  test  with  known  properties. 
Although  classical  item  statistics  (item  difficulty  and  discrimination) 
have  been  employed  for  this  purpose,  they  are  of  limited  value  for  de¬ 
scribing  the  items  in  a  bank  because  these  statistics  are  dependent  on  the 
particular  group  used  in  the  item  calibration  process.  Latent  trait  item 
parameters,  however,  do  not  have  this  limitation,  and  consequently  are  of 
much  greater  use  in  describing  test  items  in  an  item  bank  (Choppin,  1976; 
Wright,  1  977).  The  i r. vs i  lance  property  of  the  latent  trait  item  parameters 
makes  it  possible  to  obtain  item  statistics  that  are  comparable  across 
dissimilar  groups.  Let  us  assume  that  we  arc  interested  in  describing 
items  using  the  two- parameter  logistic  test  model.  The  single  drawback 
is  that  because  the  mean  and  .standard  deviation  of  the  ability  scores  are 
arbitrarily  established ,  the  ability  score  metric  is  different  for  each 
group.  Since  the  item  parameters  depend  on  the  abi!  Lty  scale,  it  is  not 
possible  to  directly  compare  latent  trait  item  parameters  derived  from 
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dil'ferent  groups  of  examinees  until  the  ability  scales  are  equated  in 
some  way.  Fortunately,  the  problem  is  not  too  hard  to  resolve  since  Lord 
and  Novick  (1968)  have  shown  that  the  item  parameters  in  the  two  groups 
are  linearly  related.  Thus,  if  a  subset  of  calibrated  items  is  admin¬ 
istered  Lo  both  groups,  the  linear  relationship  between  the  estimates  of 
the  item  parameters  can  be  obtained  by  forming  two  separate  bivariate 
plots,  one  establishing  the  relationship  between  the  estimates  of  the  item 
discrimination  parameters  for  the  two  groups,  and  the  second,  the  relation¬ 
ship  between  the  estimates  of  the  item  difficulty  parameters.  Having 
established  the  linear  relationship  between  item  parameters  common  to  the 
two  groups,  a  prediction  equation  can  then  be  used  to  predict  item 
parameters  for  those  items  not  administered  to  the  first  group.  In  this 
way,  all  item  parameters  can  be  equated  to  a  common  group  of  examinees 
and  corresponding  ability  scale. 

Test  Development 

The  important  differences  between  developing  tests  using  standard 
methods  and  methods  based  on  latent  trait  theory  occur  during  the  follow¬ 
ing  steps:  (1)  Item  analysis,  (b)  selection  of  test  items,  and  (c)  reli¬ 
ability  assessment. 

Item  analysis  techniques  involve  (1)  the  characterization  of  test 
items  and  (2)  the  use  of  statistical  information  for  revising  and/or 
deleting  test  items.  The  ma/jor  problem  with  item  statistics  (item  diffi¬ 
culty  and  discrimination)  derived  from  standard  item  analyses  is  that 
they  are  sample  dependent.  This  problem  is  overcome  by  characterizing 
items  in  terms  of  latent  trait  parameters. 
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Lacent  trait  theory  not  only  provides  the  test  developer  with  surnpj. 
invariant  item  parameters  but  also  with  a  far  more  powerful  method  of  item 
selection  (Birnbaum,  1968).  This  method  involves  the  use  of  information 
curves,  i.e.,  items  are  selected  depending  upon  the  amount  of  information 
they  contribute  to  the  total  amount  of  information  supplied  by  the  test. 
One  of  the  useful  features  of  item  information  curves  is  that  the  contri¬ 
bution  of  each  item  to  the  test  information  function  can  be  determined 
without  knowledge  of  the  other  items  in  the  test.  Vlhen  standard  testing 
technology  is  applied  the  situation  is  very  different.  The  contribution 
of  any  item  to  such  statistics  as  test  reliability  cannot  be  determined 
independently  of  the  characteristics  of  all  the  other  items  in  Lhe  test. 

Lord  (1977)  outlined  a  procedure,  originally  presented  by  Birnbaum 
(1968),  for  the  use  of  item  information  curves  building  a  test  to  meet 
any  desired  set  of  specifications.  The  procedure  employs  a  pool  of  cali¬ 
brated  items,  with  accompanying  information  curves,  such  as  might  be  ob¬ 
tained  from  the  item  banking  methods  previously  described.  The  procedure 
outlined  by  Lord  consists  of  the  following  steps: 

1.  Decide  on  the  shape  of  the  desired  test  information  curve. 

Lord  (1977)  calls  this  the  target  information  curve. 

2.  Select  items  with  item  information  curves  that  will  fill  up 
the  havd-to-f ill  areas  under  the  target  information  curve. 

3.  After  each  item  is  added  to  the  test,  calculate  the  test 
information  curve  for  the  selected  test  items. 

4.  Continue  selecting  test  items  until  the  test  information  curve 
approximates  the  target  information  curve  to  a  satisfactory 


degree . 


“tu" 


Examples  of  Che  application  of  this  technique  to  the  development  of  tests 
for  differing  ranges  of  ability  (based  on  simulated  d ltn)  are  given  hv 
Cook  and  Hambleton  (1979).  Some  results  from  their  study  are  reported 
in  Figure  7. 

An  excellent  discussion  of  item  selection,  as  it  pertains  to  tests 
developed  according  to  Rasch  model  procedures,  is  presented  by  Wright 
and  Douglas  (1975)  and  Wright  (1977).  The  item  selection  procedure 
basically  consists  of  specifying  the  ability  distribution  of  the  group 
for  whom  the  test  is  intended  and  then  choosing  items  such  that  the  dis¬ 
tribution  of  item  difficulties  matches  the  distribution  of  abilities. 

This  procedure  is  equivalent  to  that  originally  introduced  by  Birnbaum 
(1968),  since  in  this  case:  Lne  item  information  curves  depend  only  on 
the  difficulty  parameters. 

iu  latent  trait  theory  test  information  curves  replace  the 

familiar  concepts,  reliability  and  standard  error  of  measurement.  The 
use  of  the  test  information  curve  as  a  measure  of  accuracy  of  estimation 
is  appealing  for  at  least  two  reasons:  (1)  Its  shape  depends  only  on 
the  items  included  in  the  test,  and  (2)  it  provides  an  estimate  of  the 
error  of  measurement  at  each  ability  level. 

Test  Score  Interpretations 

One  primarv  use  of  a  criterion-referenced  test  is  to  obtain  an 
estimate  of  at'  examinee's  level  oi  mastery  (or  1  ability  )  on  an  objective. 
Thus,  a  straightforward  application  of  one  of  the  latent  trait  models 
(the  assumption  of  unidimcnsionality  would  not  likely  be  a  problem)  would 
produce  examinee  ability  scores.  Among  the  advantages  of  this  applica¬ 
tion  would  be  that  items  could  be  sampled  (for  example,  at  random)  from 
an  item  pool  for  each  examinee,  and  all  examinee  ability  estimates  would 


be  on  a  common  scale. 


Test  Information 
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Figure  7.  lest  information  Curves  Produced  With  Five 
Item  Selection  Methods  [30  Test  items] 
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Since  item  parameters  are  invariant  across  groups  o;  examinees,  i : 
would  be  possible  to  construct  criterion-referenced  tests  to  "d iscr ininni' " 
at  different  levels  of  the  ability  continuum.  Thus,  a  test  developer 
might  select  an  "easier"  set  of  test  items  for  a  pretest  than  a  posit t si , 
and  still  bo  able  to  measure  "examinee  growth"  by  estimating  examinee 
ability  at  each  test  occasion  on  the  same  ability  scale.  This  cannot 
be  done  with  classical  approaches  to  test  development  and  test  score 
interpretation.  If  we  had  a  good  idea  of  the  likely  range  of  ability 
scores  for  the  examinees,  test  items  could  be  selected  so  as  to  maximize, 
the  test  information  in  the  region  of  ability  for  the  examinees  being 
tested.  The  optimum  selection  of  test  items  would  contribute  substantially 
to  the  precision  with  which  ability  scores  were  estimated.  In  the  case  of 
criterion-referenced  tests,  it  is  common  to  observe  lower  test  performance 
on  a  pretest  than  on  a  posttest;  therefore,  the  test  constructor  could 
select  the  easier  test  items  from  the  domain  of  items  measuring  an  ob¬ 
jective  for  the  pretest  and  more  difficult  items  could  be  selected  for 
the  posttest.  This  would  enable  the  test  constructor  to  maximize  the 
precision  of  measurement  of  each  test  in  the  region  of  ability  where  the 
examinees  would  most  likely  be  located.  Of  course,  if  the  assumption 
about  the  location  of  ability  scores  was  not  accurate,  gains  in  precision 
of  measurement  would  not  be  obtained. 

The  results  reported  in  Tables  1  to  4  show  clearly  the  advantages 
of  "tailoring"  a  test  to  the  ability  level  of  a  group.  Of  murse,  the 
potential  improvements  depend  on  the  validity  of  a  test  developer's 
assumption  about  the  examinee  ability  distribution.  If  he  or  she  user  an 
incorrect  prior  distribution  as  a  basis  for  designing  a  test,  the  resulting 
test  will  certainly  not  have  the  desired  charac ter is r ics . 
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A  second  important  use  of  criterion-referenced  tests  is  to  produce 
examinee  test  scores  that  can  be  used  to  obtain  "domain  score  estimates." 
Much  has  already  been  made  of  the  "item-free"  ability  estimates  which  are 
derivable  from  latent  trait  models.  However,  while  ability  estimates 
have  the  definite  advantage  of  being  "item-free,"  ability  scores  are 
measured  on  a  .scale  which  appears  to  be  far  less  useful  to  practitioners 
than  the  domain  score  scale.  After  all,  what  does  it  mean  to  say,  0  =  1.5? 
Domain  scores  can  be  defined  on  the  interval  [0,  1]  and  provide  informa¬ 
tion  about  examinee  levels  of  performance  (proportions  of  content  mastered) 
relative  to  the  objectives  (described  by  domain  specifications)  measiuod 
on  the  test.  As  long  as  the  test  items  arc  a  representative  sample  of 
test  items  from  the  domain  of  items  from  the  domain  of  items  measuring 
an  objective,  the  associated  "test  characteristic  curve"  (or  more  correctly, 
the.  "score  characteristic  curve")  can  be  used  to  obtain  domain  score 
estimates  from  ability  score  estimates.  When  a  non-representative  set 
of  test  items  is  included  in  a  test,  examinee  performance  on  the  set  of 
test  items  is  used  to  estimate  examinee  ability  and  the  score  information 
curve  for  the  total  pool  of  calibrated  test  items  measuring  an  objective 
is  used  to  estimate  domain  scores  from  ability  scores. 

Figure  9  provides  a  graphical  representation  of  the  procedure 


out  1 ined . 


Domain  Score 
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Figure  9. 


A  comparison  between  the  test  characteristic 
curves  for  mthc  domain  of  item  measuring  an 
objective, and  l2!ihc  particular  ilcrns  included 
in  a  test  which  measure  the  objective. 
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I  will  briefly  introduce  one  additional!  criterion- 

referenced  testing  problem  which  probably  can.  he  resolved  by  using 
latent  trait  models  and  concepts.  It  is  common  for  instructors 

to  change  their  tests  irom  one  group  of  examinees  to  the  next.  This 
is  often  done  to  improve  the  tests,  to  insure  test  security,  to  reflect, 
minor  adjustments  in  courses  end  so  on.  The  problem  is  t.o  insure  that 
the  standards  of  performance  required  of  students  across  the  different 
versions  of  a  test  are  the  same.  The  fart  that  a  candidate  must'  achieve 
a  test  score,  of  (say)  lj07,  on  either  test  to  receive  a  passing,  score  does 
not  guarantee  the  equivalence  of  the  two  tests.  For  example,  it  may 
turn  out  that  one  test,  is  somewhat  easier  than  the  other.  Required  is 
a  rethod  for  "equating"  scores  from,  one  test  to  another.  Equating  of  to 
scores  will  improve  the  usability  of  the  derived  scores  for  individual 
interpretations  and  course  evaluations.  Equating  of  lest  scores  on 
nonri-rc  feivnced  test  s  lias  occupied  a  great  deal  of  attention  and  much 
useful  work  hctS  been  done.  Currently,  most  Lest  score  equating,  is  being, 
clone  via  the  use  of  latent  trait  models  (the  one-  and  three-parameter 
logistic  test  models  iiu  the  most  popular).  In  fact,  there  is  evidence 
to  nugget. t  that  latent  trait  node  1  approaches  to  equating  are  often  far 
superior  to  classical  re r hod;; .  However,  with  criterion-referenced  teats 
we  often  have  relatively  short  teats  and  modest  numbers  of  examinees  and 
therefore  latent  fieri  model  equating,  methods  need  to  be  developed  for 
use  in  this  special  testing  .situation.  To  date,  equating,  studies  have 
often  been  done  with  rather  lnrg.e  numbers  of  examinees  and  test  items. 
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Conc lus ions 


The  exploration  of  latent  trait  nodcls  and  their  application  to 
educational  testing  and  measureuen  t  p  rob  lens  iias  been  under  study  for 
about  ten  years  now.  Certainly  there  are  many  problems  requiring 
resolution  but  enough  is  known  about  latent  trait  models  to  use  the:.! 
successfully  in  solving  many  testing  problems.  Kith  respect  to  the 
field  of  criterion-referenced  testing,  the  task  as  I  see  it  is  one  cf 
identifying  those  problems  '..-hi eh  can  be  handled  by  latent  trait  model 
technology  rather  than  whether  or  not  the  technology  should  be  used. 

On  the  positive  side, 

1.  Latent  trait  models  appear  to  provide  an  excellent  basis 
for  equating  non-parallel  forms  of  competency  tests  at 
the  distrj ct  and  state  level. 

2.  Several  useful  computer  programs  exist  to  carry  out  re¬ 
quired  analyses. 

3.  Several  new  textbooks  and  articles  are  now  available  to 
the  interested  practitioner  (Hamblcton,  Lord,  Wright  L 
Stone,  and  Warm,  to  name  four). 

4.  Other  promising  applications  of  latent  trait  models  are  in 
the  areas  of  adaptive  testing,  item  bias,  test  development, 
arid  test  score  interpretations.  For  example,  Weiss  and 
his  colleagues  at  the  University  of  Minnesota  have  some 
impressive  results  on  the  effects  of  adapt ive  testing  in  r he 
area  of  criterion-referenced  testing.  Lob  Lentz  and  his 
colleagues  at  Georgia  State  University  are  doing  some  ex¬ 
cellent  work  on  the  study  of  Vest  score  reporting  systems. 

On  the  other  hand, 

1.  I  see  little  reason  to  recommend  the  use  of  latent  trait 
models  in  daily  classroom  management  of  students.  Latent 
trait  models  will  offer  little  note  than  a  headache  to 
classroom  teachers.  because  (1)  c r  i  terion-ref  e r-’ncc.d 
tests  are  typically  short,  (2)  sample  sizes  are  small 
(although  item  banks  nav  reduce  the  importance  of  th’S 
factor),  (3)  the  required  time  for  trainin'-1  of  teachers 
in  a  new  system  of  mea.surenc  nt  would  be  extension,  and 
(4)  any  gains  in  measurement  precision  that  might  accrue 
would  be  marginal,  I  cannot  recommend  applications  in  this 
particular  area. 
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2.  No  data  r.cL  will  ever  be  fit  perfect  l.y  ‘-y  .1  model. 

LTiat  is  not  known  is  how  much  misfit  can  he  toleraLed 
by  a  model  and  still  have  any  advantages  of  the  model 
hold  in  practice.  Latent  trait  models  are  strong, 
i.e.,  based  on  restrictive  assumptions,  and  therefore 
this  general  area  requires  consiGerably  more  research 


The  viability  of  latent  trait  models  for  test  development  work 
is  clear  but  more  effective  implementation  could  be  achieved  if  several 
questions  were  satisfactorily  answered: 

1.  The  choice  of  a  model  is  one  question.  At  the  test  devel¬ 
opment  stage,  the  practitioner  has  the  option  of  developing 
items  to  fit  a  specific  latent  trait  model.  It  would  greatly 
facilitate  the  test  development  process,  if  practical  guide¬ 
lines  existed  that  provided  a  logical  basis  lor  making  this 
choice . 

2.  A  second  question  concerns  the  reason  for  item  misfit.  At 
the  present  level  of  technical,  sophistication,  the  test 
developer,  faced  with  a  misfitting  item,  can  do  little  more 
than  subjectively  examine  the  item,  and  hope  that  the  reason 
for  misfit  will  be  apparent. 

3.  The  problem  of  determining  whether  or  not  a  pool  of  items  can 
be  considered  uni  dimensional  in  an  important  one.  Factor 
analytical  techniques  are  often  used  for  this  purpose  but 

there  are  problems  (iiambleton  ft  al. ,  1979;  Lord  &  N'ovick,  1968). 

4.  One  area  of  current  interest  involves  the  equating  of  a 
criterion-referenced  Lest  to  a  norm-ref eror.ccd  test  so  that 
CRT  scores  can  be  reported  in  terms  of  a  norm-ref erenccd 
framework  without  actually  carrying  out  a  national  norming 
study.  Such  an  equating  study  is  often  discussed  v.'ithin  the 
context  of  Title  I  evaluations.  Legal  issues  aside,  how  best 
to  do  the  equating  is  not  clear  (for  example,  how  large  end 
representative  a  sample  of  examinees  is  needed?)  nor  is  the 
minimum  size  of  the  correlation  between  the  two  sets  of  scores 
which  is  needed  to  insure  a  stable  equating  known. 

Numerous  test  developers  are  now  considering  the  use  of  latent  trait 
models  in  their  work.  Hopefully  this  paper  will  provide  some  newcomers  to 


the  area  with  a  suitable  introduction  to  t lie  topic. 
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